[Fix][Pipeline] Prevent double expansion of shared buffers across sibling pipelines#2342
[Fix][Pipeline] Prevent double expansion of shared buffers across sibling pipelines#2342harelhuang wants to merge 3 commits into
Conversation
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
2fa543f to
5e9d81a
Compare
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughDetects and reconcile sibling-expanded shared buffers during pipeline injection: introduces a BufferReplacer, records original→old-expanded mappings when encountering previously expanded sibling buffers, rewrites pipelines to use consistent expanded buffers, and applies replacements at block scope. ChangesBuffer reconciliation for sibling pipelined loops
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/transform/inject_pipeline.cc`:
- Around line 754-767: old_expanded_to_new_ currently accumulates pairwise
mappings and can create transitive chains (e.g., A->B then B->C) so a single
BufferReplacer call leaves intermediate buffers in the block; when building
mappings from old_expanded_buffers_ and rewrite_result.buffer_remap you must
collapse/compose chains to the final target before calling BufferReplacer: for
each mapping created from old_expanded to new_expanded, follow any existing
mapping in old_expanded_to_new_ (and any new mappings from
rewrite_result.buffer_remap) until you reach the final buffer and store
old_expanded -> final_new_expanded (or alternatively clear/rebuild
old_expanded_to_new_ at the start of the parent loop and only insert the latest
mappings); apply the same composition/clear logic at the other occurrence
(around the 1084-1088 region) so BufferReplacer receives only direct final
mappings.
- Around line 45-82: BufferReplacer currently only rewrites
BufferLoad/BufferStore/DeclBuffer/AllocBuffer but leaves SBlock metadata
(alloc_buffers, reads, writes) pointing to old Buffer objects, causing mixed
Buffer instances for the same data; update BufferReplacer to also handle
SBlock-like nodes by adding/overriding VisitStmt_ for the node type that carries
block metadata (the SBlock container used in this IR) and, when
replacements_.Get(old_buf) is true, replace entries in alloc_buffers, reads, and
writes with the new Buffer objects and reconstruct the SBlock node (similar to
the approach in flatten_buffer.cc) so nested blocks are fully reconciled across
the subtree.
- Around line 51-63: When replacing BufferLoad/BufferStore in VisitExpr_(const
BufferLoadNode *op) and VisitStmt_(const BufferStoreNode *op) via
replacements_.Get, preserve and recursively mutate child nodes instead of
rebuilding from op with empty predicate/indices; specifically, for
BufferLoadNode use new_buf.value() but pass visited indices (visit each
op->indices via VisitExpr) and preserve/visit op->predicate (use VisitExpr on
predicate if present) and for BufferStoreNode pass visited indices (visit each
index) as well as the already-visited value; keep the original span. If not
replacing, continue to call the StmtExprMutator fallbacks.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f6d02361-323f-4e2e-bb9a-0683d5637127
📒 Files selected for processing (1)
src/transform/inject_pipeline.cc
…CUDA) When two T.Pipelined loops share the same alloc_shared buffer but use different num_stages (e.g. 2 and 4), the first pipeline expands the buffer from (M,K) to (2,M,K). The second pipeline then tries to expand the already-3D buffer again, producing a 4D buffer that crashes LayoutInference with a shape mismatch ICHECK. This fix extends the original PR tile-ai#2337 to work on CUDA backend: 1. Detect already-expanded buffers in pipeline_allocs by matching against pending_buffer_remap_ entries via their data Vars (not by reverse Buffer object lookup, which fails on CUDA path). 2. Replace already-expanded 3D buffers with their original 2D form so RewritePipeline creates a fresh expansion for the second pipeline. 3. After RewritePipeline, reconcile old and new expanded Buffer objects across the entire block body via a BufferReplacer (handles BufferLoad, BufferStore, DeclBuffer, AllocBuffer), so LayoutInference sees a single consistent buffer object per Var. Tested on CUDA (NVIDIA L20) with asymmetric num_stages (2 and 4), and on the reproducer from issue tile-ai#2309. No regression on standard single-pipeline GEMM. Fixes tile-ai#2309. Closes tile-ai#2337.
5e9d81a to
a27d91c
Compare
|
@harelhuang Thanks, would you mind providing a related test or a script to help us reproduce and review? |
|
Sure! Here's a minimal reproducer that crashes without the fix on CUDA: import tilelang
import tilelang.language as T
M, N, K = 512, 512, 512
BM, BN, BK = 64, 64, 32
@tilelang.jit
def asymmetric_stages(A, B):
A: T.Tensor((M, K), T.float16)
B: T.Tensor((K, N), T.float16)
C = T.empty((M, N), T.float16)
K_half = K // 2
with T.Kernel(T.ceildiv(N, BN), T.ceildiv(M, BM), threads=128) as (bx, by):
A_shared = T.alloc_shared((BM, BK), T.float16)
B_shared = T.alloc_shared((BK, BN), T.float16)
C_local = T.alloc_fragment((BM, BN), T.float32)
T.clear(C_local)
for ko in T.Pipelined(T.ceildiv(K_half, BK), num_stages=2):
T.copy(A[by * BM, ko * BK], A_shared)
T.copy(B[ko * BK, bx * BN], B_shared)
T.gemm(A_shared, B_shared, C_local)
for ko in T.Pipelined(T.ceildiv(K_half, BK), num_stages=4):
T.copy(A[by * BM, K_half + ko * BK], A_shared)
T.copy(B[K_half + ko * BK, bx * BN], B_shared)
T.gemm(A_shared, B_shared, C_local)
T.copy(C_local, C[by * BM, bx * BN])
return C
kernel = asymmetric_stages.compile(M=M, N=N, K=K)Without the fix, this crashes with: The two Tested on CUDA (NVIDIA L20) — compiles and runs correctly with the fix. |
…ling pipelines (CUDA) When two T.Pipelined loops share the same alloc_shared buffer but use different num_stages (e.g. 2 and 4), the first pipeline expands the buffer from (M,K) to (2,M,K). The second pipeline then tries to expand the already-3D buffer again, producing a 4D buffer that crashes LayoutInference with a shape mismatch ICHECK. Fix: 1. Detect already-expanded buffers in pipeline_allocs by matching against pending_buffer_remap_ entries via their data Vars. 2. Replace already-expanded 3D buffers with their original 2D form so RewritePipeline creates a fresh expansion. 3. Reconcile old/new Buffer objects across the entire block body via BufferReplacer (handles BufferLoad/BufferStore/DeclBuffer/AllocBuffer and SBlock metadata). 4. Compose replacement chains to handle 3+ sibling pipelines sharing the same buffer. Fixes tile-ai#2309. Closes tile-ai#2337. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Hi @LeiWang1999, could you re-run the failed CI job? The only failure is Our regression test
All other checks (Quick Lint, pre-commit.ci, CodeRabbit) are green. |
|
@LeiWang1999 Confirmed this is a pre-existing flaky test — the same Our change only modifies Could you re-run the failed CUDA job? |
|
Correction: the bump-version PR is #2354 (not the branch name). It had the same |
Summary
T.Pipelinedloops share the samealloc_sharedbuffer but use differentnum_stagesRoot Cause
When two
T.Pipelinedloops shareA_shared/B_sharedbuffers with differentnum_stages(e.g. 2 and 4):A_sharedfrom(64,32)to(2,64,32)viaRewriteAllocBufferCollectUsedPipelineBufferspicks up the already-expanded(2,64,32)bufferRewriteAllocBufferinserts another dimension, producing(4,2,64,32)— a 4D bufferLayoutInferencecrashes:ICHECKfails comparing(2,64,32)input shape with(4,64,32)expected shapeFix (3 parts)
pipeline_allocsby matching againstpending_buffer_remap_entries via their data Vars (not by reverse Buffer object lookup, which fails on CUDA path)RewritePipelinecreates a fresh expansion for the second pipelineBufferReplacer(handlesBufferLoad,BufferStore,DeclBuffer,AllocBuffer) soLayoutInferencesees a single consistent buffer object per VarTest plan
make -j8builds cleanly🤖 Generated with Claude Code
Summary by CodeRabbit
Bug Fixes
Tests